Goto

Collaborating Authors

 non-dominant language


ShifCon: Enhancing Non-Dominant Language Capabilities with a Shift-based Contrastive Framework

arXiv.org Artificial Intelligence

Although fine-tuning Large Language Models (LLMs) with multilingual data can rapidly enhance the multilingual capabilities of LLMs, they still exhibit a performance gap between the dominant language (e.g., English) and non-dominant ones due to the imbalance of training data across languages. To further enhance the performance of non-dominant languages, we propose ShifCon, a Shift-based Contrastive framework that aligns the internal forward process of other languages toward that of the dominant one. Specifically, it shifts the representations of non-dominant languages into the dominant language subspace, allowing them to access relatively rich information encoded in the model parameters. The enriched representations are then shifted back into their original language subspace before generation. Moreover, we introduce a subspace distance metric to pinpoint the optimal layer area for shifting representations and employ multilingual contrastive learning to further enhance the alignment of representations within this area. Experiments demonstrate that our ShifCon framework significantly enhances the performance of non-dominant languages, particularly for low-resource ones. Further analysis offers extra insights to verify the effectiveness of ShifCon and propel future research


Language Imbalance Driven Rewarding for Multilingual Self-improving

arXiv.org Artificial Intelligence

Large Language Models (LLMs) have achieved state-of-the-art performance across numerous tasks. However, these advancements have predominantly benefited "first-class" languages such as English and Chinese, leaving many other languages underrepresented. This imbalance, while limiting broader applications, generates a natural preference ranking between languages, offering an opportunity to bootstrap the multilingual capabilities of LLM in a self-improving manner. Thus, we propose Language Imbalance Driven Rewarding, where the inherent imbalance between dominant and non-dominant languages within LLMs is leveraged as a reward signal. Iterative DPO training demonstrates that this approach not only enhances LLM performance in non-dominant languages but also improves the dominant language's capacity, thereby yielding an iterative reward signal. Fine-tuning Meta-Llama-3-8B-Instruct over two iterations of this approach results in continuous improvements in multilingual performance across instruction-following and arithmetic reasoning tasks, evidenced by an average improvement of 7.46% win rate on the X-AlpacaEval leaderboard and 13.9% accuracy on the MGSM benchmark. This work serves as an initial exploration, paving the way for multilingual self-improvement of LLMs. Large Language Models (LLMs) have revolutionized the field of Natural Language Processing (NLP) with superior performance across numerous tasks. However, existing studies show that due to the imbalance of pre-training and fine-tuning data across languages, existing LLMs have predominately benefited a few "first-class" languages, particularly English and Chinese, thereby overlooking a wide range of other languages (Qin et al., 2024). Given that LLMs are used worldwide, such language imbalance presents significant risks for users who operate in less dominant languages (Deshpande et al., 2023). To this end, enhancing the multilingual performance of LLMs has gained increasing attention. Previous research predominantly frames this imbalance as an issue to be resolved, often addressing it through multilingual training and cross-lingual alignment.


Large Language Models are Good Multi-lingual Learners : When LLMs Meet Cross-lingual Prompts

arXiv.org Artificial Intelligence

With the advent of Large Language Models (LLMs), generating rule-based data for real-world applications has become more accessible. Due to the inherent ambiguity of natural language and the complexity of rule sets, especially in long contexts, LLMs often struggle to follow all specified rules, frequently omitting at least one. To enhance the reasoning and understanding of LLMs on long and complex contexts, we propose a novel prompting strategy Multi-Lingual Prompt, namely MLPrompt, which automatically translates the error-prone rule that an LLM struggles to follow into another language, thus drawing greater attention to it. Experimental results on public datasets across various tasks have shown MLPrompt can outperform state-of-the-art prompting methods such as Chain of Thought, Tree of Thought, and Self-Consistency. Additionally, we introduce a framework integrating MLPrompt with an auto-checking mechanism for structured data generation, with a specific case study in text-to-MIP instances. Further, we extend the proposed framework for text-to-SQL to demonstrate its generation ability towards structured data synthesis.


A New AI Lexicon: Algolinguicism

#artificialintelligence

Which languages and language-users are prioritized by digital platforms? Speakers of non-dominant languages are disproportionately subject to algorithmic harms.¹ They confront content moderation algorithms that "only work in certain languages"² on platforms that structurally omit non-Western nations from governance considerations. I call this tendency algolinguicism -- a matrix of automated processes that minoritize language-users outside the Global North and obstruct their access to political participation. This essay addresses digital platforms as sites of algolinguicism.